"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin." The Bitter Lesson, Rich Sutton, 2019.
Glove dates from 2014. All relevant information can be found on the the project site hosted in stanford.
The algorithm is rather easy:
"The GloVe model is trained on the non-zero entries of a global word-word co-occurrence matrix, which tabulates how frequently words co-occur with one another in a given corpus."
The basic idea applies to word2vec as well as to Glove:
This is done with so-called matrix-factorization; The matrix is the co-occurence matrix of words in documents.
The example below is a typical example of collaborative filtering. The matrix-factorization has become very popular in the recommender-system community due to the 1-Million-Dollar-Challenge by Netflix.
At start, each word (item, user) is given a random embedding vector. The scalar product (dot-product) between the vectors is asked to reconstruct the content of the respective cell. The error, i.e. the difference between the scalar-product and the actual content of the cell is propagated into the embedding vectors that are adapted accordingly.
The resulting embeddings are able to reconstruct the word co-occurence matrix (or the ratings a user gave a certain movie).
CBOW: take the embeddings of the surrounding words and try to predict the masked (missing) word in the middle.
Skip-Gram: Take the embedding of the word in the middle and try to predict the words around it.
FastText paper.
But a more approchable explanation can be found here:
While Glove and Word2Vec work on the word-level, FastText is working on a n-gram level. In this way it is learning the internal structure of words. Thus, FastText has no out-of-vocabulary words (those not present during training) and is able to learn similarities by the word stems.
BERT is a classical transformer encoding step. Instead of prediction the next word in a sentence (as done with recurrent neural networks such as LSTMs), it is predicting the masked words (ca.~15%). The information of the words present is shared among all positions in the network.
The [CLS] token signals the beginning of a new sentence (its embedding is often used for sentence classification).
The [MASK] token signals the places where the correct word has to be guessed; [PAD] is just to fill all input-sentences to the same length. This is more efficient since sentences can be batched together.
The classification head on the left is used during training. For inference the cosine-similarity between the output-embeddings of Sentence A and Sentece B is computed (right side).
In each training step, there is a anchor sentence and a positive example that is semantically equivalent to the anchor sentence. Moreover, there are negative examples that are just random sentences not similar to the anchor sentence.
The so-called triplet-loss function pushes the anchor and the positive embeddings closer to each-other (cosine of 1) and the anchor and the negative embeddings as well as the positive embeddings and the negative embeddings further away from each-other (euclidean or cosine).
where $\alpha$ is the margin, the amount by which negative examples have to been further away from the anchor than positive examples.
This is the publication on this ingenious idea.
The student model is the XLMR from Facebook.
FLOPS: FLoating point OPerations per Second; What is a Peta FLOP (PFLOP)?: $10^{15}$ operations per second
"To match what a 1 PFLOPS computer system can do in just one second, you'd have to perform one calculation every second for 31,688,765 years." taken from here, pics are taken form here
### The number of parameters is growing faster than the memory of the accelerators
The size of the Transformers grows 240 times every 2 years
Cell In[1], line 1 pics are taken form [here](https://github.com/amirgholami/ai_and_memory_wall/tree/main/imgs/pngs) ^ SyntaxError: invalid syntax
taken from this podcast with George Hotz
taken from: https://arxiv.org/pdf/1706.03762.pdf
Important things to note are:
Why Self-Attention?
Self-attention allows the model to treat words with weighted importance:
taken from http://lucasb.eyer.be/transformer
Multi-Head Self-Attention
Transformer models do not see single characters:
taken from https://platform.openai.com/tokenizer
Architecture of some important Transformer models
taken from Lukas Beyer again
taken from Andrej Karpathy
Colossal Cleaned Crawled Corpus (C4) This is 800GB of cleaned common internet crawl. https://github.com/google-research/text-to-text-transfer-transformer#c4
BookCorpus "The books have been crawled from https://www.smashwords.com, see their terms of service for more information."
Stack-Exchange preferences
Instruction Data-Sets
taken from Lilian Weng
Data-Sets "stolen" from ChatGPT
The project can be found here
Reinforcement-Learning with human feedback (RLHF)
Chain-Of-Thought (COT) Training
Lineage of Chat-GPT
taken from: How does GPT Obtain its Ability?
taken from Andrej Karpathy
Distilling ChatGPT: "Our data generation process results in 52K unique instructions and the corresponding outputs, which costed less than $500 using the OpenAI API."
Here is a link to the 'open source' models and there performance.
LORA (Low-Rank-Adaptation)
This is corresponding paper.
The pretrained weight matrix $\mathbf{W}$ is frozen during training. An additional weight matrix is trained with the two low-rank (r) matrices $\mathbf{A}$ and $\mathbf{B}$. Only these weights (orange) are update. The input-vector (dark-blue) is multiplied with the frozen weights as well as with the low-rank-adaption of the weight matrix. The Results are just added.
During training only the gradients for the orange matrices have to be kept in GPU-memory.
taken from Sebastian Raschka
Bits and Bytes
Tim Dettmers et al., 2022
QLora by Dettmers et al., 2023
In few words, QLoRA reduces the memory usage of LLM finetuning without performance tradeoffs compared to standard 16-bit model finetuning. This method enables 33B model finetuning on a single 24GB GPU and 65B model finetuning on a single 46GB GPU. see here
illustration taken from here
SpQR by Dettmers et al., 2023 This is not for training but for inference
"Specifically, we provide an efficient GPU inference algorithm for SpQR which yields faster inference than 16-bit baselines at similar accuracy, while enabling memory compression gains of more than 4x."
Prompt Engineer will not be a job to stay.
taken from Andrej Karpathy
From Mishra et al., 2022:
Prompting is only to imitate as well as possible the training data.
Fiction
Since most models are also trained on one or several book corpora, they can also be prompted to take on a fictional persona.
Zero-Shot CoT prompting
Let's think step by step
Remember Byte-Pair-Encoding:
For more funny examples with "O" see this twitter feed.
Why Large Language Models can not calculate with large numbers:
The rise of vector data-bases
New research is extending the context length
User: Here is the first book of Harry Potter. (book in looong context) Read it and describe the main characters!
ChatBot:
Harry Potter, the protagonist of the story, who is marked with a lightning scar and destined to fight the evil Lord Voldemort
Hermione Granger, Harry’s best friend and a brilliant witch who excels at academics and logic
Ron Weasley, Harry’s other best friend and a loyal and brave wizard from a large family
Albus Dumbledore, the headmaster of Hogwarts and a powerful and wise wizard who guides and mentors Harry
Lord Voldemort, the main antagonist of the story, who is a dark wizard who seeks to destroy Harry and conquer the wizarding world
User: Now, explain in more detail the role of Albus Dumbledore.
ChatBot: Can you give me the book again? I first have to read it.
research in progress: agents
There are specialized models for:
(summary of Leadership needs us to do Gen AI, what do we do?)
1. Set expectations:
2. Minimize risk:
Analyze:
Have a data strategy:
Avoid big sweeping decisions:
Image('../images/is_the_case_worth_it.png')
Here are 5 questions to ask to understand if a gen AI product will be successful:
1/ If you took the word "AI" out, is the product still solving a customer problem?
AI is a solution, not a problem.
Ask yourself:
2/ How accurate does the solution need to be?
Plot the problem on a fluency vs. accuracy grid.
Gen AI today is great for high fluency + low accuracy problems (e.g., productivity).
It's not great for solutions that need high accuracy (e.g., financial decisions).
3/ How fast will incumbents move?
Incumbents like Microsoft, Google, and Adobe have moved incredibly fast on AI.
Startups that overlap with core incumbent use cases might struggle.
e.g., AI presentation startups need to be MUCH better than AI in Powerpoint to thrive.
4/ Is there a moat?
Examples moats include:
And of course...speed of execution.
5/ Is it overvalued?
If an AI product already has $100M+ valuation, you should think:
Can it continue to grow and (more importantly) retain users?
In a crowded space like AI copywriting and productivity - that could get hard.
6/ To recap, here are 5 questions to ask to evaluate AI products and companies:
7/ I hope these questions also help builders who are thinking of creating new AI products.